Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Apr 20, 2019

What changes were proposed in this pull request?

In the PR, I propose to use the TIMESTAMP_MICROS logical type for timestamps written to parquet files. The type matches semantically to Catalyst's TimestampType, and stores microseconds since epoch in UTC time zone. This will allow to avoid conversions of microseconds to nanoseconds and to Julian calendar. Also this will reduce sizes of written parquet files.

How was this patch tested?

By existing test suites.

@SparkQA
Copy link

SparkQA commented Apr 20, 2019

Test build #104775 has finished for PR 24425 at commit 0edfed2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2019

Test build #104776 has finished for PR 24425 at commit 02a03ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP][SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default [SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default Apr 20, 2019
Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we document this? persistence data format changes are potentially fairly impactful

@SparkQA
Copy link

SparkQA commented Apr 21, 2019

Test build #104782 has finished for PR 24425 at commit 8d7931a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good given that INT96 is deprecated in Parquet too, and apparently migrating looks fine 777b797

@SparkQA
Copy link

SparkQA commented Apr 22, 2019

Test build #104797 has finished for PR 24425 at commit e0342ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class DescribeQueryCommand(queryText: String, plan: LogicalPlan)

@SparkQA
Copy link

SparkQA commented Apr 22, 2019

Test build #104800 has finished for PR 24425 at commit 25fd404.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@HyukjinKwon
Copy link
Member

Merged to master.

@MaxGekk MaxGekk deleted the parquet-timestamp_micros branch September 18, 2019 15:58
rshkv added a commit to palantir/spark that referenced this pull request Mar 9, 2021
The upstream default is INT96 but INT96 is considered deprecated by
Parquet [1] and we rely internally on the default being INT64
(TIMESTAMP_MICROS).

INT64 reduces the size of Parquet files and avoids unnecessary conversions
of microseconds to nanoseconds, see [2].

Apache went down the same route in [2] but then reverted to remain
compatible with Hive and Presto in [3].

[1] https://issues.apache.org/jira/browse/PARQUET-323
[2] apache#24425
[3] apache#28450
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants